RNAseq analysis mini project.

#Loading the data set and preparing it for the analysis. The required packages were loaded.

suppressMessages(library("DESeq2"))
suppressMessages(library("tidyverse"))
suppressMessages(library("recount"))
suppressMessages(library("ggpubr"))
suppressMessages(library("EnhancedVolcano"))
suppressMessages(library("EnsDb.Hsapiens.v86"))
suppressMessages(library("msigdbr"))
suppressMessages(library("clusterProfiler"))
suppressMessages(library("PoiClaClu"))
suppressMessages(library("pheatmap"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("org.Hs.eg.db"))
suppressMessages(library("DT"))

For this project, I wanted to use RNAseq dataset to compare treated and untreated conditions. So I created a project info to search for treated control models into recount database. I found the SRP065849 fit my description, so I downloaded count data of it.

project_info <- abstract_search(query = "treated control")
View(project_info)
selected_study <- "SRP065849"
download_study(selected_study)

The count data was loaded. The read count matrix and sample metadata was checked if they are ready for analysis. I noticed the condition data was not present, so I have to add it manually. I obtained the condition details from the link below and added as a condition column to the rse_gene dataframe.

From here the condition details were obtained for this experiment. https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP065849&o=acc_s%3Aa

I also noticed that the gene IDs were not in correct format to search, so the gene ids were corrected by deletin the dots present in the IDs.

load("SRP065849/rse_gene.Rdata")

cts <- assay(rse_gene)
View(cts)


cd <- as.data.frame(colData(rse_gene))
View(cd)

rse_gene$condition <- c(rep("treated",5), rep("untreated", 4))
cd <- as.data.frame(colData(rse_gene))
View(cd)

rownames(rse_gene) <- gsub(rownames(rse_gene), 
                           pattern = "\\..+", 
                           replacement = "")

PART 1 :DEGs analysis using DESeq2

1: Quality Control

DESeq data set was created as dds, by setting condition as design. Design means which property you want to compare on your samples. After that a heatmap plotted to visulize Poisson distances between samples. This plot tells us whether this count data will give the answers we need or not. In this case, if there is differential gene expression present between the treated and untreated samples, we expect to see color difference between treated and untreated samples in the heatmap.

I selected to use Poisson distance because the data was in raw form (non-normalized). If it was a normalized dataset, I’d have to use Euclidean distance.

As you can see in the heatmap, color differince present. This means, we can perform differential expression analysis in this dataset.

dds <- DESeqDataSet(rse_gene, design = ~ condition)

poisd <- PoissonDistance(t(counts(dds)))
samplePoisDistMatrix <- as.matrix( poisd$dd)
rownames(samplePoisDistMatrix) <- paste(dds$condition, sep=" - " )
colnames(samplePoisDistMatrix) <- NULL
colors = colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(samplePoisDistMatrix,
         clustering_distance_rows = poisd$dd,
         clustering_distance_cols = poisd$dd,
         col = colors)

When we’re working with DEGs analysis, first thing we need to do is to transform the data. I applied rlog transformation to dds dataset and saved as rld object. Then PCA plot was plotted. The variance in PCA plot indicates how strong the affects such as batch effect, contamination, etc. on the data.

rld <- rlog(dds)
plotPCA(rld, intgroup = c("condition"))

2: Running differential gene expression analysis

Differential gene expression analysis was performed and results were saved as results. The contrast in the results indicates that I want get the fold change of treated sample over control sample.

dds <- DESeq(dds)
results <- results(dds, contrast = c("condition","untreated","treated"))

MA plot was plotted to examine how the fold change related to mean of each genes that expressed. Notice that the lines near the axis. They indicates that low expressed genes with high fold changes. But they’re not significant. In order to get a more accurate plot for the fold change, we need to get rid of them statistically.

plotMA(results, ylim = c(-5, 5))

So, for this purpose LFC shrink normalization applied and saved as resNorm object. A MA plot was plotted again for normalized data. After that, I saved the resNorm again as data frame into results object.

3: Normalization of the data

resNorm <- lfcShrink(dds = dds, res = results, type = "normal", coef = 2)

plotMA(resNorm, ylim = c(-5,5))

results <- as.data.frame(resNorm)
View(results)

4: Addition of gene annotions to results and plotting a volcano plot

After that steps, we need to add gene annotions to the results, and saved the results as anno_results.

anno <- AnnotationDbi::select(org.Hs.eg.db, rownames(results), 
                              columns=c("ENSEMBL", "ENTREZID", "SYMBOL","GENENAME"), 
                              keytype="ENSEMBL")

anno_results <- as.data.frame(results) %>%
  rownames_to_column(var = "ENSEMBL") %>%
  left_join(anno, by = "ENSEMBL")

Table 1: DEGs analysis results table

datatable(anno_results,options=list(scrollX = TRUE,
                         scrollY = TRUE),
            caption = 'Table 1: Enriched Pathways',
            class = "nowrap display")

A volcano plot was plotted. In order to make interpretations to this data, we need to identify significant genes.

EnhancedVolcano(anno_results, 
                lab = anno_results$SYMBOL,
                x = "log2FoldChange", 
                y="padj", 
                border = "full", 
                borderWidth = 1.5, 
                borderColour = "black", 
                gridlines.major = FALSE, 
                gridlines.minor = FALSE, 
                title = "Untreated versus treated")

5: Seperating significant gene sets and plotting a heatmap

We can obtain significant genes by taking the genes that have padj value less than 0.01, log2FoldChange greater than or equal to 1 and, base mean greater than or equal to 20. I saved the results as significant object and printed the head of significant object to examine.

After that we need to order the significant genes and set the get the top DEGs with the helpp of the ids, so we set the ENSEMBL, SYMBOL ids of significant genes as id1 and id2 objects. Then we subset the matrix by id1, which collects DEGs into DEGs objct by taking ids (id1) as base. Then symbols (id2) added as rownames of DEGs. After that top 20 DEGs were obtain by taking 20 rows of the DEGs.

A heatmap was plotted to visualize top 20 differentially expressed genes.

significant <- anno_results[which(anno_results$padj < 0.01 &
                          abs(anno_results$log2FoldChange) >= 1 &
                                    anno_results$baseMean >= 20), ]

head(significant[order(significant$log2FoldChange ), ] )
##               ENSEMBL    baseMean log2FoldChange     lfcSE      stat
## 38558 ENSG00000248939   1900.9167      -7.304877 0.3403203 -12.97524
## 10848 ENSG00000162631 154439.6498      -4.848216 0.2548772 -19.11121
## 51958 ENSG00000271447  10767.3761      -4.418288 0.2503990 -17.69466
## 31440 ENSG00000232896    890.1011      -4.384182 0.3859359 -11.09997
## 6479  ENSG00000131386  53712.9171      -4.341699 0.2237796 -19.43618
## 30238 ENSG00000231156    389.8445      -4.321078 0.3852984 -10.15897
##             pvalue         padj ENTREZID  SYMBOL
## 38558 1.690735e-38 1.199452e-35     <NA>    <NA>
## 10848 2.036871e-81 2.203640e-77    22854   NTNG1
## 51958 4.609717e-70 1.994855e-66    79148   MMP28
## 31440 1.254859e-28 4.242501e-26     <NA>    <NA>
## 6479  3.815190e-84 8.255118e-80   117248 GALNT15
## 30238 3.022537e-24 6.475263e-22     <NA>    <NA>
##                                               GENENAME
## 38558                                             <NA>
## 10848                                        netrin G1
## 51958                       matrix metallopeptidase 28
## 31440                                             <NA>
## 6479  polypeptide N-acetylgalactosaminyltransferase 15
## 30238                                             <NA>
mat <- assay(rld)
orderedSig <- significant[order(significant$padj), ]
id1 <- significant$ENSEMBL
id2 <- significant$SYMBOL
DEGs <- mat[id1,]

rownames(DEGs) <- id2
top20DE <- head(DEGs, n=20)


pheatmap(top20DE, 
         scale = "row", 
         clustering_distance_rows = "correlation",
         main="Top 20 Differentially Expressed genes")

Table 2: Top 20 DEGs

datatable(top20DE,options=list(scrollX = TRUE,
                         scrollY = TRUE),
            caption = 'Table 2: Top 20 differentially expressed genes (DEGs)',
            class = "nowrap display")

PART 2: GSEA - Gene Set Enrichment Analysis

1: Preparation and filtering the data by adding GSEA metric

To make Gene Set Enrichment Analysis (GSEA), first we need to rank all of the genes. But when we add gsea_metric column, it contains infinite numbers. So to get rid of the infinite numbers, padj column added. It basically changes padj to smallest number of padj anywhere it gets 0. After that, rows with missing values removed and the data was arranged by gsea_metric in descending order. The filtered results saved as resdf object.

resdf <- anno_results %>%
  arrange(padj) %>%
  mutate(gsea_metric = -log10(padj) * sign(log2FoldChange)) %>%
  mutate(padj = case_when(padj == 0 ~ .Machine$double.xmin,
                          TRUE ~ padj)) %>%
  filter(! is.na(gsea_metric)) %>%
  arrange(desc(gsea_metric))

Table 3: GSEA Data set

datatable(resdf,options=list(scrollX = TRUE,
                         scrollY = TRUE),
            caption = 'Table 3: GSEA data',
            class = "nowrap display")

2: Runing GSEA

To run GSEA we need ranked vector of the GSEA datas and gene sets. So GSEA datas were ranked and saved as a vector into the ranks object. Gene sets was obtained from msigdbr package and saved as gene_sets object. And GSEA was performed. The results saved as data frame into gsearesdf object and printed. Dotplot was plotted.

gene_sets <- msigdbr(species = "Homo sapiens", category = "C5")
gene_sets <- gene_sets %>%
  dplyr::select(gs_name, gene_symbol)

ranks <- resdf %>%
  select(SYMBOL, gsea_metric) %>%
  distinct(SYMBOL, .keep_all = TRUE) %>%
  deframe()

gseares <- GSEA(geneList = ranks, 
                TERM2GENE = gene_sets)

gsearesdf <- as.data.frame(gseares)


dotplot(gseares)

3: Ploting top 5 over-expressed pathways

So in order to make a plot that nicely understandable top 5 of over-expressed pathways was extracted from the gsearesdf data frame as top_pathways and gseaplot was plotted for each pathway. Plots were saved as GSEA_up.png

top_pathways <- gsearesdf %>%
  top_n(n = 5, wt = NES) %>%
  pull(ID)

top_pathway_plots <- lapply(top_pathways, function(pathway) {
  gseaplot(gseares, geneSetID = pathway, title = pathway)})

top_pathway_plot <- ggarrange(plotlist = top_pathway_plots,
                              ncol = 3, nrow = 2, labels = "AUTO")

ggsave(top_pathway_plot, filename = "GSEA_up.png",
       scale = 1.25,
       height = 11, width = 18)

Table 4 : Top 5 over-expressed pathways.

  gsearesdf %>%
  top_n(n = 5, wt = NES) %>%
datatable(options=list(scrollX = TRUE,
                         scrollY = TRUE),
            caption = 'Table 4: Top 5 over-expressed pathways.',
            class = "nowrap display")
Gseaplots for top 5 over-expressed pathways
GSEA_up.png

GSEA_up.png

4: Ploting top 5 under-expressed pathways

The same methodology was applied to plot Gseaplots for top 5 under-expressed pathways.

bottom_pathways <- gsearesdf %>%
  top_n(n = 5, wt = -NES) %>%
  pull(ID)

bottom_pathway_plots <- lapply(bottom_pathways, function(pathway) {
  gseaplot(gseares, geneSetID = pathway, title = pathway)})

bottom_pathway_plot <- ggarrange(plotlist = bottom_pathway_plots,
                                 ncol = 3, nrow = 2, labels = "AUTO")

ggsave(bottom_pathway_plot, filename = "GSEA_down.png",
       scale = 1.5,
       height = 11, width = 18)

Table 5 : Top 5 under-expressed pathways.

gsearesdf %>%
  top_n(n = 5, wt = -NES) %>%
datatable(options=list(scrollX = TRUE,
                         scrollY = TRUE),
            caption = 'Table 5: Top 5 under-expressed pathways.',
            class = "nowrap display")
Gseaplots for top 5 under-expressed pathways
GSEA_down.png

GSEA_down.png

Table 6: Enriched pathways table

sig_gsea <- gsearesdf %>%
  filter(p.adjust <= 0.05) %>%
  arrange(desc(NES)) %>%
  select(NES, pvalue, p.adjust, core_enrichment) %>%
  mutate(pvalue = format(pvalue, digits = 5, 
                         scientific = TRUE),
         p.adjust = format(p.adjust, digits = 5, 
                           scientific = TRUE),
         NES = signif(NES, 5)) 
  
datatable(sig_gsea,options=list(scrollX = TRUE,
                         scrollY = TRUE),
            caption = 'Table 6: Enriched Pathways',
            class = "nowrap display")

PART 3: Conclusion

So we can answer these questions with the results : How does treatment affects the cell behavior?

In the over-expressed pathways, we see “epidermal cell differentiation” and “skin development” pathways. But in under-expressed pathways, these pathways were not present. So we can say that treatment suppressed epidermal cell differentiation and skin development pathways. Also, in the experiment, they say that some of the treated samples show resistance to treatment. We see “mitotic spindle organization” and “nucleosome assembly” in under-expressed pathways (treated). These pathways may be the cause of resistance to treatment. New experiments can be conducted to investigate the affect of these pathways. In addition, we can apply gene enrichment analysis based drug repurposing to target specific pathways enriched in melanoma.